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WHAT IS THE SIGNIFICANCE OF NEURAL NETWORKS FOR Al? 

Abstract 

Associative memory (AM) and attentive associative memory (AAM) have been 
reviewed in terms of simple neural networks (both uniform and nonuniform 
matched filter banks - read by inner products and written by outer products 
in parallel). Whereas AM has been applied to optical character recognition 
(OCR) using the set of orthogonal feature vectors deduced from image proc- 
essing and computer vision, AAM can incorporate Al expert system techniques 
for determining the nonuniform linear combination of outer products. A 
rule-based system can more efficiently incorporate the frequency distribu- 
tion of distorted characters according to user group profiles; i.e., left- 
handed versus right-handed writing. Specifically, in this paper we have 
examined the degree of fault tolerance in AM, the ability of genera 1 i zat ion 
by interpolation (auto-associative memory), and abstraction by extrapolation 
(hetero-assoc i at i ve memory). The efficiency of the closed system of rule- 
based knowledge representation of Al using tuple storage has been combined 
with the flexibility of the non-rule-based open system using the matrix 
knowledge representation of Nl (coined for either neural, or network, or 
natural intelligence). Thus, the ability of generalization and abstraction 
becomes possible in a combined intelligent system of Al and Nl. 
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ABSTRACT 

Associative memory (AM) and attentive associative memory (AAM) have been reviewed 
in terms of simple neural networks (both uniform and non-uniform matched filter banks: read by 
inner products and write by outer products in parallel). While AM has been applied to the optical 
character recognition (OCR) using the set of orthogonal feature vectors deduced from image 
processing and computer vision, AAM can incorporate AI expert system techniques for 
determining the non-uniform linear combination of outer products. A rule-based system can 
more efficiently incorporate the frequency distribution of distorted characters according to user 
group profiles, say left-handed writing versus right-handed writing. Specifically in this paper, we 
have examined the degree of fault tolerance in AM, the ability of generalization by interpolation 
(auto-associative memory) and abstraction by extrapolation (hetero-associative memory). The 
efficiency of the closed system of rule-based knowledge representation of AI using the tuple storage 
has been combined with the flexibility of the non-rule based open system using the matrix 
knowledge representation of NI (coined for either Neural, or Network, or Natural Intelligence). 
Thus, the ability of generalization and abstraction becomes possible in a combined intelligent 
system of AI and NI. 

1. INTRODUCTION 

The question of the significance of neural networks for AI may be subdivided into three 

aspects. 


(i) How can neural networks help solve AI problems ? 

ANSWER: Both the well understood fault-tolerance of associative memory (AM), and the 
lesser understood ability of neural networks for generalization and abstraction, can be usefully 
incorporated into AI techniques. 

(ii) How can AI help solve neural network problems ? 

ANSWER: Similar to computer aided design, AI expert systems with a neural network 
modules can help design special purpose architectures for neural network computing. 

(iii) Wliat unsolved problems can be solved efficiently by combining AI and NI (coined for 
either Neural, or Network, or Natural Intelligence) techniques to utilize their respective strengths? 

ANSWER: The optical character recogniton (OCR) for reading hand-written bank check 
and zip-codes, can be solved by combining both AI and NI techniques, as described in this paper. 


Because we can only build a small neural network, we wish to endow a small sei 
neurons with a human-like intelligence. With present technology, whether it be electronic 
optical, one cannot build a neural network of more than several hundred neurons, using existi 
processor elements (PE's), because of the technological difficulty associated with der 
interconnectivity, about N^ for N T PE's. Thus, artificial neural networks can not yet match the si 
and the complexity of the human brain, that has billions of neurons and thousands 
interconnects for each neuron. If we are not, overly ambitious in developing a general purpc 
neural computer, we can built a special purpose neural computer for solving special purpc 
problems, such as OCR. 

One way to accomplish this special purpose neural computer is to combine the traditior 
rule-based AI wisdom with non-rule-based NI learning. This is particularly desirable in solvi: 
OCR problems because the available small neural networks can use better feature vectors obtain 
from other disciplines. Neural networks, built with current technology, can then provide fai 
tolerance for input feature vectors variations. The specific problem of hand-written characl 
recognition, differs from the more regular, hand-printed, alphanumeric recognition problem 
that it must account for such complications as connected characters and characters broken 
segmentation. 

Conceptually, one could solve the OCR problem using analytic, rule-based AI or neui 
network techniques. The OCR problem can be subdivided into character (or character strin 
statistics, font recognition, and character recognition; the most efficient techniques for these thr 
subproblems are analytic (statistical), rule-based AI, and neural networks, respectively. Since f 
statistical techniques, applied to alphanumeric frequencies, is well known, this topic will noti 
discussed further. In solving the font recognition subproblem, AI rules can be set by the (statist* 
frequency distribution of individual distorted characters according to user group profiles, e.g. le 
handed writing versus right-handed writing. It is efficient to design AI expert system that drz 
upon the classical statistical pattern recognition, e.g. one stroke difference exists between "P " a] 
"R ", or between"0 " and "Q ", or in a low pass filter viewpoint only one stroke locatior 
difference exists among four rounded letters "P " and ”R ","0 ", and "Q Furthermore, the 
rules of pair character distortion distribution can help solve the problem of connected charac: 
and broken character after segmentation, such as two scripted zeros. The pair characer correlati 
matrix can be analyzed by the technique of the Karhunen-Loeve procedure in image processir 
The Karhunen-Loeve technique is compatable with AM's outer product decomposition. With t 
help of AI rule-based system, both the first and the second order statistics can be incorporated in t 
formalism of attentive associative memory (AAM), that processess the extra degrees of freedom 
the non-uniform storage of vector outer products based on a given set of critical feature vectors. 

Because the open-ended knowledge of input pattern variations may be efficien 
controlled by using other disciplinary knowledge, such as AI and computer vision with a result 
better combined technology, we shall review AM and AAM, and various OCR approaches 
means of their specific techniques used for feature extraction and techniques used for gro 
classification. The sooner we accept implementation limitations of the present neurocompute 
the better we can work with other disciplinary researchers. For example, we can work w 
researchers in AI, computer vision, image processing. Since this cross disciplinary collaboration 


by nature not easy because of different trainings and languages involved, then this paper may serve 
a door opener for both. 

Pattern recognition reseachers have been successful in machine-printed character 
recognition (CR) compared to optical character recognition (OCR) of hand-written bank checks or 
zipcodes. Difficulties of applying AI alone to an intelligent OCR may be due to the lack of non- 
rule-based capability of generalization and abstraction. This may be constrained by the traditional 
AI one dimensional (1-D) knowledge representation, e.g. an ordered set of tuples used in semantic 
networks. Similarly, difficulties of applying the neural network alone to an intelligent OCR may 
be in selecting critical features that is precisely one of the most challenging and unsolved problems 
(others are segmentations and locations). On the other hand, AI is efficient in reduce the problem 
to a sub-problem based on 1-D knowledge representation of simple rules, and NI provides the 
fault-tolerant OCR system based on 2-D knowlege representation. Together they give the possibility 
of generalization and abstraction. Thus, Szu and Tan (1988) have considered a less risky approach 
that consists of the traditional AI researchers who know about OCR critical features, and the neural 
network experts who know about AM fault tolerance. Technological developments have pointed 
to the readiness of such collaborations, since 2-D storage by chips or optical disks becomes cheaper 
than the traditional 1-D content addressable memory (CAD) processor. What's needed is a smart 
coprocessor such as neurocomputer. As a matter of fact, due to the 2-D nature of light, optical 
expert systems based on AM have been designed by Szu and Caulfield (1987) who have shown as 
simple replacement of 1-D tuples by 2-D matrices in a semantic network the alias problem for data 
fusion is solved by matrix addition and thresholding. The opto-electronical implementation of 
attentative associative memory model of Athale, Szu & Frielander (1986) can be expanded by 
means of a priori probability compiled by a pair-character correlation function of script letters. 
These papers may facilitate both sides the starting line of collaborations. 

In this paper, we have reviewed the orthogonal subspaces of features and examined (1) the 
degree of fault tolerance , (2) the generalization by interpolation to other orthogonal feature vectors 
within the subspace, and (3) the abstraction by extrapolation to other subspaces. AAM may be 
formulated by a linear combination of outer products based on a set of orthogonal feature vectors. 
The combination coefficient is called the attention parameter, because it enters into the eigenvalue 
of AAM matrix that governs the recall convergence. We review briefly about the dynamics of 
attentive associative memory published by Szu (1988) elsewhere using arbitrary coefficients. In 
this paper we explicitly introduce a Al-tuple for the attention vector a = {a n , n=l,...M}, where the 
inner product between the difference vector between an averaged stochastic input I Q > and a fixed 
memory state lm> is naturally used as the attention parameter defined in terms of Dirac's inner 
product notation: a m = <m|m> - < m I Q >. Such an AAM matrix has non-white eigenvalue 
spectrum X n = a n - (A / B ) where the attentive memory capacity is A = I M n =l a n> anc * B is the 
length of the feature vectors (e.g. the number of bits). Iterative recalls are used. Paying non- 
uniform attention (a n > 1) increases the memory capacity A > M together with a faster 

convergence rate proportional to the larger eigenvalue Xpn ^ X than a uniform attention! i.e.a m = 
1). Szu's (1988) analysis has suggested that the eigenvalue spectrum and its dithering by input 
ensemble can play a crucial role for the convergence associated with a nonlinear dynamical 
system. \ 


2. Associative Memory 


I 

Matrix associative memory works like a parallel bank of matched filters but much rr 
efficiently in at least three counts: (1) no address coding of input and decoding for output 
necessary , (2) operations are done in parallel, and (3) the connectivity matrix can be determi 
bv itself using various adaptive (learning) algorithms. 

An analytical and numerical example of AM is given as follows: 

We denote M feature vectors as binary words, u( m ), m=l,...M. Each word has B bits, 
inner product of Eq(l) measures the norm, the number of bits that are one. 

U T • U = # of one's (1) 

where the superscript transpose the column vector to a row vector. 

The associated bipolar words, denoted by V , m=l, ...M, are defined as follows: 

V = (2 U - 1) = Sgn( U ) (2) 

where the unit vector 1 has all entries equal 1 and Sgn is the sign function that changes zero ; 
negative quantities to -1. We prefer bipolar version to binary version because : (1) the in 
product norm is always identical to the number of bits, B: 

VT • V = B = <V I V> , (3) ^ 

rewritten here in terms of Dirac's bracket notation: <bra I ket> for the inner and I ketxbra I for 
outer product, (2) the nature of "exclusive or" can be easily represented by bipolar multiplication 

+1 x +1 = 1, -1 x -1 = 1, +1 x -1 = -1, -1 x +1 = -1, 

(3) the inner product norm is related to the Hamming distance, defined to be the numbei 
different bits between two vectors no matter where the differences occur. 

We assume an orthogonal set of feature vectors defined as follows: 

y(n )T *v(m) = B 5 ni m = < n | m > (4) 

where 5 nm is the Kronecker delta. The outer product weight matrix W represents ai 
associative memory: 


[ W ]= Z m [ V< m )v(m)]= I m | m > < m I 


(5) 


Hopfield (1982, 1984) assumed the auto-associative matrix [T] to be traceless. That was used 
together with the symmetry property to prove convergence. Thus, the second term of Kronecker’s 
delta matrix (l's along the main diagonal and zero elsewhere) is introduced in Eq (6) to make it 
traceless. 


B[ T ] ij = [ W ] ij - M 5g (6) 

B is the normalization constant, and M is the memory capacity. Using the trace operation denoted 
by Tr, we can easily verify Eq (6) to be traceless. 

Tr( I mxm I ) =B (7) 

Tr( [8y ] ) = B (8) 

The tradeoff between the memory capacity and the degree of fault-tolerance has been estimated to 
be about 15 % of B bits [Hopfield (19S2)] for pseudo-orthogonal vectors. That is, 

M = 0.15 B (9) 


For orthogonal feature vectors, however, the capacity is 100 %. 

M= B (10) 

This fact can be demonstrated by the eigenvalue problem of the matrix which is defined to be 

[ T ] I n > = ln> (11) 

where the eigenvalue can be easily verified, using Eqs (4) and (6), to be degenerate , namely, a 
white spectrum for all M states, 

V, = 1 - (M/B) (12) 

The full capacity, M = B, corresponds to a zero eigenvalue for all B orthogonal eigenstates, one for 
each feature vector. 

Consider a simple example where B = 4. There are 4 possible orthogonal vectors and 2^ = 
16 possible words denoted by: 

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 - 


We introduce orthogonal subspaces defined by the number of contiguous l's in the binary 
word. The subspace consisting of words 13, 11, 7, and 14 is obviously orthogonal by shifting a "one" 
among 3 zeroes from the the left to the right end of the word. 


Word Binary Word Binary 


P 


13 

1101 

2 

comple. 

0010 

3 

13 

11 

1 Oil 

4 

0100 

3 

11 

7 

0111 

8 

1000 

3 

7 

14 

1110 

1 

0001 

3 

14 

15 

mi 

0 

0000 

4 

15 

6 

0110 

9 

1001 

2 

6 

12 

1100 

3 

0011 

2 

12 

10 

1010 

5 

0101 

1 

10 


P 

Word Bipolar 

Word Bipolar 

11-11 

2 

comple. 
-1-1 +1-1 

3 

1-111 

4 

-1+1-1 -1 

3 

-1111 

8 

+1-1-1-1 

3 

111-1 

1 

-l-l-l+l 

3 

1111 

0 

-1-1 -1 -1 

4 

-1 1 1-1 

9 

+ 1-1-1+1 

2 

1 1 -1-1 

3 

-l-l+l + l 

2 

1-1 1-1 

5 

-l + l-l + l 

1 


It is readily verified that the subspace of bipolar words (13, 11, 7, 14) are mutua 
orthogonal to one another, as shown in Figure 1. They happen to be related to the Walsh funct 
of periodicity p=3. The corresponding binary words have an equal angle among them [cos'l (2/ 
that is not 90°. Also, the second subspace of bipolar words (15, 6, 12, 10) are also orthogonal but 
two subspaces are not orthogonal to each other. 



Figure 1. Two-Dimensional Representation of Walsh Base Functions 
Used to illustrate the fault tolerance and generalization properties 
of Associative Memory 


We consider the storage of one word in memory. 


4 [ T] ] =[13] = 113x131 -5 


(13) 








If the outer product is properly normalized, it is related to the projection operator: 

[P] = 8 - 113x131 (1/B) (14) 

Using Eq (4) , it can be verified that 

[P] 2 =[P]. (15) 

We will show (1) the ability of fault tolerance, and (2) the ability for generalization. 

Fault Tolerance 

The following sequence of erasing (zero out) successively from the bipolar bits illustrate 
tolerance of missing bits. 

(1) one missing bit 

[13]( 0 1-1 l) T =Sgn( 3 -2 -2 3) T = 1 13> (16) 

where Sgn is sign function representing the sigmoid neuron response by the point nonlinearit) 
extracting the algebra sign of each entries. 

(2) two missing bits 


[13] ( 0 0-1 l) T =Sgn( 2 2-1 1) T = 1 13> 


(17) 


(3) three missing bits 

[13] (000 l) T =Sgn( 1 1-1 0) T = 1 12> 

[13]— ( 0 0 0 1) T = Sgn( 11-1 3)T= 1 13> (18) 


(4) four missing bits 


[13] (0 0 0 0) T = Sgn ( 0 0 0 0 ) T = (-1-1-1-1)T= 1 0> 

[13] 2 (0 0 0 0) T = Sgn (-1-1 3 -1) T = 1 2> 

[13]3(0 0 0 0) T = Sgn (-3-3+3-3) T = 1 2> (19) 

which converges to a fixed point that is precisely the bipolar complement to 1 13>. In other words, 
the phase information is lost as an overall minus sign in the last case. 

The following sequence of reversing successively from the bipolar bits illustrate tolerance 
of erroneous bits. 


(1) one erroneous bit. 


[13]( -2 1-1 1) T -Sgn( 3 1-1 1) T = i 13> (20) 

(2) two erroneous bits. 

[13]( -1 -2- -1 l) T =Sgn( 1 1 1 - 1 )T= 1 14 > 

[13] 2 (- 1 -1 - -1 1) T - Sgn (-1 -1 -1 1) T = 1 1> 

[13]3(-l -l -1 1 )T = Sgn (1 1 1 1) T = 1 15> 

[13 ] 4 (-l -2 -1 1 ) T = Sgn (1 1-3 1) T = 1 13> (21) 

(3) three erroneous bits. 

[13]( -2 -2- 2 1)T =Sgn( -1 -1 1 -3) T = I 2 > (22) 

which also converges to a fixed point that is also the bipolar complement of 1 13>. 

Generalization within a subspace 

We consider the ability to recognize a new vector that is different from the stored vec^ 
In other words, an AM can recognize its related vectors that has not been memorized before, 
recognition, we mean convergence to a different fixed point. In this sense, we say that the AM c 
generalize its memory to include other fixed points. 

In the case of bipolar vectors, if and only if a new vector x is orthogonal to the stor 
vectors, associative recall "converges in a cycle of two" as defined in the following iterations: 

Sgn( [ T ] I x >) = - 1 x > (23a) 

Sgn( - [ T ] lx>) = + lx> (23b) 

This necessary and sufficient condition allows us to determine efficiently the orthogonali 
between a new vector and all the stored vectors. 

We shall show that when a new vector 1 11> is presented to the AM [13], due to t 
orthogonality between 1 13> and 1 11> and traceless property of [13], 


[13] 1 11> =Sgn( - 1 11>) = 1 4>, and 
[13]— 1 11> = lll> 


( 24 ) 


Once the system has acknowledged the second vector 1 1 1 >, it is incorporated into the 
matrix storage. 

4 [ T 2 ] =[13,11] =[13] +[11] 

= 113x131+111x111-28 (25) 

If another vector, 1 7> is presented, 

[13.11] 1 7> =Sgn( -2 1 7>)= I 8 >,and 

[13.11] - 1 7> = Sgn(4 1 7> )= I7> (26) 

Thus, we enlarge the memory storage to have three memorized states. 

4[ T 3 ] = [13,11,7] = [13] + [11] + [7 ] = 

113x131 + illxll I + 17x71 -35 (27) 

This process is continued until the 4-bit orthogonal subspace (p=3) is filled up. 

4 [ T 4 ] = [13] +[11] + [7 ] + [14] (28) 

We have demonstrated the ability to include other orthogonal vectors that have not been 
stored before. This example also shows the important consequence of traceless storage through its 
contribution to the "generalization by interpolation within the orthogonal subspace". 

Given a table of orthogonal vectors, one may argue that computing inner products will 
also determine orthogonality. However, inner products must be done pairwise among all vectors 
and become inefficient as the number of vectors gets large. The above method remains efficient for 
all sizes. 


One may furthermore argue that the difficulty is not how to construct orthogonal set, but 
to select critical bipolar features from gray-scale, imperfect images. 

Algorithms for Construct A Critical Feature : 

We shall not rely on the auto-AM to select features. One can carry out one’s favorite 
image processing procedure to extract a set of gray-scale feature vectors, { I F >}. Bipolar feature 
vectors are preferred in AM because of demonstrated fault-tolerance and the special ability of 
traceless outer product that allow a quick convergence to a fixed point of cycle two. Given a gray- 
scale feature vector I F>, several procedures for generating a bipolar feature vector are given. The 
first procedure is "bipolarization", i.e. , 


I f > = Sgn ( I F> - threshold ) 


(29) 


The second procedure is to use the Walsh transform. We apply two-dimensional 
transform (as orthogonal bipolar vector spacef I wj > }) to all grayscale features. We select i 
bipolar feature vector from a specific Walsh base vector that is associated with the maxim 
coefficient in the Walsh transform. 

i f> = Sgn(Maxj (I I wjx wj I F>) - threshold) (30) 

where the orthonormality condition of Walsh base vectors is inserted to relate to the first methc 

I !wj>< wj I = [1] (31) 

The third and the fourth procedures are to extract from the arbitrary feature vector I < 
the closest vector I g> from either the bipolar orthogonal feature set { I N> ) or the { I F> } using 
following traceless associative memory storage. 

I g> = Sgn( [IS I N> < F I ] I G> - threshold) (32) 

I g> = Sgn ( I cp [ I F> <F ] I G> - threshold) (33) 

The linear combination coefficients { cp ) may be determined by the statistics of sin 
character distortions and variances (similar to finding the normal modes that diagonalizes 
covariance matrix and the Karhunen-Loeve orthogonal procedure used for outer prod 
representation of 2-D imagery). Furthermore, the statistics of character pair distortions, sud| 
two scripted zeros, could be used to determine the coefficients so as to resolve the proble* 
recognizing connected character and broken character after segmentation. We will not go l 
details in this approach, because of its problem-dependent nature. 

The mechanism to select critical features is given as follows. 

(1) Human being picks a critical feature (pictures) among the set of distorted, handwrit 
characters, e. g. the extra stroke among O, P, Q . 

(2) Walsh transform the selected feature. 

(3) Pick the Walsh function that has the largest transform value. 

We choose a feature vector that is closest to the Walsh vector associated with the largest Wc 
transform coefficient, and the rest follows from the procedure described in eq (24-28). We call S' 
a set of features the critical features. 

Lessons to be learned about applying associative memory to pattern recognition: 

AM can only do so much. There is no way to judge the correctness of an associative rec 
except by the convergence to a fixed point. One can only assign meaning to those fixed poi: 
whether it is new or old. The proven capabilities of the AM model are (1) missing and errone 


bits recovery, and (2) the creation of new orthogonal vectors, as illustrated above. Therefore, to 
apply AM to pattern recognition, one must apply human interpretations to those capabilities. 

Since learning is by trial and error, it is a continuous process. Suppose that a feature 
vector with many components representing many features (such as leg-feature and fur-feature, etc, 
for a tiger, coded fully as 1 13>) has been memorized by the traceless outer product. Furthermore, 
suppose that only certain features are known in a sequence of imperfect input vectors. (I. e., some 
feature values are missing, e. g. , the first in the sequence is (0, 0, 1, 1)). Then, the AM can fill in 
the missing bits. After three iterations, one finds (-1, -1, 1, -1)7= I 2>. One can then enlarge the 
traceless outer product memory to include both vectors, [13, 2]. One examines the second input 
vectors (0, 0, 1, 1). One can verify that the enlarge memory can indeed recall the vector 12 >, 
which correspond to, say, a lady, rather than a tiger. The AM "mental" capacity of recognizing 
other distinct objects when they show up has been demonstrated. Following this line of thought, 
the different subspace of different size could be assigned for different classes of objects related by a 
hetero-associative memory of a rectangular matrix. Such a recognition of different classes requires 
a complete feature set coded in the AM. It can fill all orthogonal subspaces by the "generalization 
procedure" illustrated in Eq(24-28). 

3. ATTENTIVE ASSOCIATIVE MEMORY 


Recently, Amari et al has studied the dynamics of such a system, which we will give a 
simple theorem. We summarize our model equations as follows: 


< n| m > = B 5 n ,m 

(34) 

[ T ] 1 n> = 1 n> 

(35) 

The simple model of attentive associative memory [ T ] is a linear combination of outer products 
based on the set of orthogonal feature vectors, { | n> , n =1, ... M), and a cue of initial state 1 Q > that 
determines the set of attention parameters { a n } as follows: 

a n =<n|n>-<n|Q> 

(36) 

B [ T ] jj = n _i a n | n j > < n j | - A [5jj] 

(37) 

that is traceless, Tr 5jj = Tr | n j ><nj | = B , giving 

A^ M n =1 a n 

and 

= a n * (A/B) 

(38) 

(39) 

The attentive memory capacity A and eigenvalue aie reduced to liopfield's memory capacity N 

and a degenerate eigenvalue X, in case of a uniform attention( i.e. a n = 1), 


where Amari's pattern ratio r = (M/B) is defined for M bipolar words (states) of B bits (neurons) 
each. 


The dynamics is assumed to be governed by matrix-vector inner product 

Q(t + 1) = Sgn( [ T ] Q (t) ) (41) 

where a point nonlinear ity function is defined as Sgn(x) = + 1 if x > 0 , and - 1 if x < 0. 
succesive associative recall gives the iteration, indexed by t= 0, 1,2,..., such that Q (t)= Q when 
O.The eigenvalue spectrum, not the distance alone, is a proper macroscopic parameter to explain 
transient dynamical behaviors of the recalling process. In particular, the direction cosine 

S m (t) ) = < m| Q(t) > / < m j m > (42) 

has been derived and the logarithmic derivative is given by 

(d/dt) log ( 1 - S m (t)) < log ( X m / 2 ) < 0 (43) 

Convergence to a specific m-th state is guaranteed if m-th eigenvalue ( l m ) is bounded 2 > ^ 

Theorem 1 about the lower bound says that paying attention (i.e. non-uniform a n 
always increases the memory capacity A ) I^ n = -i a n > M with a faster convergence 
proportional to the eigenvalue l m >1 = 1 - r 

We conjecture that the statistical neurodynamics of associative memory may have similar beha 
to the deterministic dynamics of attentive associative memory with a non-white eigenv; 
spectrum due to random initial conditions that change with respect to the initial guess vector h 
>, t =0. The difference vector between ! Q(t) > from I m > has an inner product norm defined as 

2 D m (t) = < m I m > - < m I Q(t) > (44) 

If we assume that paying attention to the initial small guess error 2 D m (0) amounts to choo: 
nonuniform and biased storage 

app = 2 D m (0) > 1 (45) 

and all other coefficients to be identical to 1 


ap — 1 , n m . 


(46) 


By definition 


A = M + 2 D m (0) - 1. (47) 

Theorem 2 about the upper bound of Xm assumes that if a small difference vector betvveei 
the input I Q > and the specific state I m >, is used as the attention parameter a m , Eq(31a), then the 
critical relationship between the Amari's pattern ratio r and the initial error is analytically founc 
for successful recalls. 

2 D m (0) <2 + (M + 1)/ (B - 1 ) (48) 

The maximum permissible Hamming distance Dpj, from the desired m-th state to b< 
reached after iterative recalls, is given by the formula 

D H < ( B/2) - 1 -[ ( M- 1 )/ 2 ( B + 1 ) ] ( ( B/2) - 1 - (r/2 ) (49) 

4, Conclusion 

Associative memory (AM) works like a match filter , but does so efficiently. It should not 
be applied to image domain directly. Rather, it should be applied to feature domain so that a 
relatively small AM can do useful tasks at the present technology. 

We shall not rely on the auto-AM to select features. Instead, features should be selected 
using human judgement. However, auto-AM will help us find critical features and hetero- 
associative memory can perform feature extraction efficiently. 

There exists a large body of knowledge pertaining to features selection and extraction and 
pattern classification for traditional optical character recognition in the literature. This body of 
knowledge should be tapped and coupled with associative memory. One should not rule out 
the use of traditional classification techniques (such as syntactical) as extraction of high-level 
features which then become part of the input feature vector to an AM. 

Classical pattern recognition has been demonstrated with a relatively greater success in 
machine-printed character recognition compared to handprinted character recognition. 
Difficulty may be rooted in the lack of generalization and abstraction due to machine's limited 
one-dimensional knowledge representation. In principle, AM should be able to complement 
traditional OCR with 2-D knowledge representation. Various degrees of abstraction can be 
achieved through a multi-layer, two-dimensional AM architecture. Note that the present 
technology has evolved to the point where 2-D memory (chip or optical disk) is not more 
expensive than 1-D memory storage with logic unit tree content addressable memory 
processor. 


In conclusion, we can combine traditional wisdom in traditional OCR with simple ^ 
implementable in present technology to form a human-intelligence-endowed neui 
network. 

Character segmentation is an important step in character recognition. Fukushima h 
developed neural network model (selective attention) for character segmentation in I 
Neocognitron [Fukushima (1987)]. The attentive associative memory model implement 
opto-eiectronically by Athale, Szu & Friedlander (1986) can be augmented by a priori probabili 
compiled by a character-pair correlation function of connected characters. This is an interesti: 
area for more research. 

Inputs to associative memory are linear vectors whereas inputs to OCR are rectangul 
arrays. Can associative memory replicate the concept of (2-D) neighborhood? The tw 
dimensional transform that preserves the neighborhood relationship should be used for ima; 
pre-processing before applying AM to the pattern. For example, 2-D Walsh transform can gi 
a 1-D base Walsh vector (associated with the largest coefficient) as input feature vector to t 
AM. 

Can AM perform syntactical parsing [Ali and Pavlidis (1977)] or rule-based structui 
analysis [D'Amato (1982)]? Any traditional classification technique can be used to extract hig 
level features for AM. 

How can AM extract position and rotation invariant features? [cf. Szu (1986), Messner ai 
Szu (1987)]. ^ 

One difficulty in applying backpropagation network has been network size-scalii 
problem. One way to circumvent it has been to extract a small number of features as input. | 
Burr (1987), Gullichsen and Chang (1987)]. Recent advances by Ballard in 1987 permit parti 
connectivity between two successive layers which avoids combinatorial explosions of t< 
encountered when the input layer is directly connected to image pixels. Thus, spatial patte 
relationship can be efficiently preserved in such a network while coarse-graining betwei 
successive layers can desensitize pattern variation in input images. 

An AI extension of the simple AM model is attentive associative memory, (AAM), th 
allows us to apply AI to pay a non-uniform attention to each term of outer product storage, i 
a linear combination of outer products in which the set of combination coefficients 
determined by AI rule-based system, e.g. the frequency distribution of distorted characte 
according to user group profiles, e.g. left hand writing versus righthand writing. The efficieni 
of the closed system of rule-based knowledge representation of AI using the tuple storage 
combined with the flexibility of the non-rule based open system using the matrix knowled; 
representation of NI ( coined for either neural, or network, or natural intelligence). Thus, t] 
ability of generalization and abstraction becomes possible for AI, and is demonstrated in 
combined intelligent system of AI & NI. We can endow a simple neural network architectu 
based on a small set of neurons with a human-like intelligence by combining the tradition 
rule-based AI wisdom with non-rule-based learning. This is achievable because OCR requir 


better feature vectors obtained from other discipline in the sense of fault tolerance that neural 
networks built at the present technology can already provide with. 

Appendix: Generic Definition of Neural Networks 

Associative memory is a special model of neural networks. Examples of associative 
recalls from partial images and the success of nonlinear signal processing are recorded in the 
literature [cf. Kohonen (1984)]. An axiomatic definition is outlined as follows. 

We shall define three kinds of neurons: fine-grained, medium-grained and large-grained 
processor elements (PEs). A fine-grained PE, represented by the lower case word neuron , has 
no internal memory analogous to neurons in the hippocampus part of the brain that is 
responsible for fault-tolerant associative recall. A medium-grained PE, Neuron, has a built-in 
memory analogous to Neurons in biological sensory and motor control which are responsible 
for reactions to approaching danger. A large-grained PE, NEURON , has built-in memory, 
control logic, and communication capabilities equivalent to a computer. NEURONs occur in 
nature in the form of grandmother cells or pacer/conductor cells. 

These three types of neurons and their associated circuits have four kinds of interactions: 
(1) exciting, (2) inhibiting, (3) bursting, (4) grading and delaying transmission. In general they 
follow the law of the middle response or the sigmoid function (hyperbolic tangent or logistic 
functions) to amplify weak signals with a nonlinear quick rising function and suppress strong 
signals with a nonlinear tapering off saturation function. The generic definition of a Neural 
Network is a system which is: 

1. Non-linear ~ sigmoid function = point non-linearity (hard limiting) shown as 
follows: 


2. Non-local = weighted outer product = outer product (white spectrum) shown as 
follows: 


<?.- 


3. N on-stationary - piecewise time stationary = iterative algorithm shown as follows: 



4. Non-convex = constrained global optimization = simulated annealing schematica 
shown as follows: 


i 


5. Other attributes yet to be discovered . 

These successive approximations of the four h oh— principles, indicated by wiggly equal 
signs in (1-4), makes possible the unveiling of the complex and nonlinear neural (bra 
behavior. This is possible with the use of powerful computers and more accurate models 
intelligent functions. The theory is amenable to numerical simulations due to piecew i 
linear, regionally local, temporarily stationary, and locally convex approximations. 

Three decades ago, Rosenblatt and co-workers built the perceptron solely based upon t 
first attribute (nonlinearity) with stochastic implementations. Thus, with hindsight, it was r 
surprising that Minsky and Papert could show a limited utility and propose useful alternati' 
artificial intelligence (AI) rule- based systems. AJ works in closed systems where rules gove 
while neural intelligence (NI) works in open systems where rules have yet to be discovert 
Various exploitation of these efforts in neural networks are: 




The term wet-ware, coined by Carver Mead, is neither software nor hardware, but more like a 
Hecht-Nielsen's net-ware based on non-programmable but trainable networks. A special version 
of layered neural networks has been demonstrated with the ability of phonetic interpolation in 
the Rumelhart, Sejnowski connectionist’s networks, such as Net-Talk, Boltzmann and Cauchy 
Machines, and error back propagation networks. 
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